Failed autonomous investigations by customer - Low volume
Monitor - https://app.axiom.co/legion-security-ry4e/monitors/view/ic6O5drGw9hBjRNu2S
Overview​
This monitor is designed to detect failures for customers with a low volume of autonomous runs. It evaluates performance over a broad time window and calculates the organization's overall success rate. If the success rate falls at or below the configured threshold, the monitor triggers an alert.
An alert from this monitor indicates that, for a relatively long period, a specific customer's success rate has remained at or below the defined threshold. This typically suggests a persistent issue rather than a short-term fluctuation.
Finding the problem​
-
Run the monitor's query and find the list of customers with success rate at or below the defined threshold
logs
| parse tostring(body) with "Finished investigation " investigation_id " in org " org_id
| where isnotempty(investigation_id) and isnotempty(org_id)
| where ['resource.deployment.environment'] == "production"
| summarize TotalRuns = count() by investigation_id, org_name
| join kind=leftouter (
logs
| parse tostring(body) with "Investigation '" investigation_id "' ending. Status: " status ", Reason: " reason
| where isnotempty(investigation_id) and isnotempty(status) and isnotempty(reason) and status == "failed"
| where ['resource.deployment.environment'] == "production"
| summarize FailedRuns = count() by investigation_id, org_name
) on investigation_id, org_name
| extend FailedRuns = iff(isnull(FailedRuns), 0, FailedRuns)
| summarize
TotalRuns = sum(TotalRuns),
FailedRuns = sum(FailedRuns),
SuccessRuns = sum(TotalRuns - FailedRuns),
SuccessPercentage = (1.0 - (1.0 * sum(FailedRuns) / sum(TotalRuns))) * 100
by org_name
| where TotalRuns >= 3
| where FailedRuns > 1
| summarize SuccessPercentage=avg(SuccessPercentage) by org_name
| where SuccessPercentage <= 75 -
Open the session viewer, filter by the customer name, set the audit status to Failed and session type to Autonomous. Then go through each failed investigation — in most cases, you'll find enough details there to understand the issue and fix it.
-
If there is not enough data in session viewer, check Axiom telemetry for traces/logs indicating failures in the failed sessions, here are few queries you can start with
traces
| where ['attributes.investigation_id'] == "<SESSION_ID>" and ['status.code'] == "ERROR"logs
| where ['attributes.investigation_id'] == "<SESSION_ID>" and severity == "error"